Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Generic Alert Rules #115

Merged
merged 12 commits into from
Dec 20, 2024
Merged

Enable Generic Alert Rules #115

merged 12 commits into from
Dec 20, 2024

Conversation

MichaelThamm
Copy link
Contributor

@MichaelThamm MichaelThamm commented Dec 17, 2024

Issue

Currently, anyone who wants to have up/absent alerts for their charm they would have to do this per charm.

“up” has slightly different semantics between remote-write and scrape:

  • if prometheus failed to scrape, then it’s down.
  • if grafana agent failed to to remote-write (regardles of whether scrape succeeded) then it’s absent.

To help with centralizing the up/absent rules, we need a convenient way to "inject" alerts on the fly.

Solution

Support for generic alerts can be centralised in cos-lib by adding the functionality to add rules from a string in either single-rule or official-rule format. This then allows Prometheus (and Mimir) or Grafana Agent to consume this functionality and have a generic up/absent rule in their library provided by the O11y team using best-practice knowledge. This prevents operators having to make their own up/abesnt rules and potentially not using best-practice alerting methods.

When adding rules from string a custom group name and/or prefix is acceptable which is sanitized according to Prometheus metric name format, replacing illegal characters for underscores.

Example Implementation

Given a rule from string in official-rule format (can also be single-rule format)

generic_rules = yaml.safe_load(
    textwrap.dedent(
        """
        groups:
          - name: MyHostHealth
            rules:
            - alert: HostDown
              expr: up < 1
              labels:
                severity: critical
            - alert: HostUnavailable
              expr: absent(up)
              labels:
                severity: critical
        """
    )
)

Add the rules from string to the existing rule set

alert_rules = AlertRules(query_type="promql", topology=self.topology)
alert_rules.add(generic_rules)

Context

Testing Instructions

This is an ops-independent change. See unit tests.

Upgrade Notes

@MichaelThamm MichaelThamm marked this pull request as ready for review December 17, 2024 22:41
@MichaelThamm MichaelThamm requested a review from a team as a code owner December 17, 2024 22:41
src/cosl/rules.py Outdated Show resolved Hide resolved
src/cosl/rules.py Outdated Show resolved Hide resolved
src/cosl/rules.py Outdated Show resolved Hide resolved
src/cosl/rules.py Outdated Show resolved Hide resolved
src/cosl/rules.py Outdated Show resolved Hide resolved
src/cosl/rules.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dstathis dstathis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. I am glad we are moving forward with this.

I am not sure exactly why we need to parse strings to accept up alert rules. Would it not be cleaner and easier to just express the alert rules as python objects rather than strings?

The blocking comment here is the one about the method name.

src/cosl/rules.py Outdated Show resolved Hide resolved
src/cosl/rules.py Outdated Show resolved Hide resolved
@MichaelThamm
Copy link
Contributor Author

MichaelThamm commented Dec 19, 2024

Would it not be cleaner and easier to just express the alert rules as python objects rather than strings?

The point is that we have an add method now which works the same as the add_path method of the AlertRules class where you can extend the rule groups with a new group (i.e. the generic up/absent ones) from dict.

I assume you mean to have these generic up/absent rules as an AlertRules object rather than create a new class for this "python object". IIUC then this would require a rework of the AlertRules class to add functionality to combine two AlertRules objects like alert_rules.add(AlertRules(query_type="promql", topology)).

The commodity here is a Prometheus alert YAML so it makes sense to take an input approximating this format rather than hiding it behind an object.

@MichaelThamm MichaelThamm dismissed dstathis’s stale review December 20, 2024 16:34

The blocking requested changes were accepted and included in the PR

@dstathis
Copy link
Contributor

dstathis commented Dec 20, 2024

Would it not be cleaner and easier to just express the alert rules as python objects rather than strings?

The point is that we have an add method now which works the same as the add_path method of the AlertRules class where you can extend the rule groups with a new group (i.e. the generic up/absent ones) from dict.

I assume you mean to have these generic up/absent rules as an AlertRules object rather than create a new class for this "python object". IIUC then this would require a rework of the AlertRules class to add functionality to combine two AlertRules objects like alert_rules.add(AlertRules(query_type="promql", topology)).

The commodity here is a Prometheus alert YAML so it makes sense to take an input approximating this format rather than hiding it behind an object.

I was not suggesting using some alert rule class; simply {"name": ...} instead of '{"name": ...}'. It would probably be nicer for the developer, but if there is a time crunch, strings are fine.

@MichaelThamm
Copy link
Contributor Author

I was not suggesting using some alert rule class; simply {"name": ...} instead of '{"name": ...}'. It would probably be nicer for the developer, but if there is a time crunch, strings are fine.

I like this idea, will implement it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants